Forecasting User Behavior for The Gym Chain

Alexander Feldman V.1.1

This project investigate user behavior for The gym chain Model Fitness.
The goals of the project:

  • Analyze the factors affecting customer churn.
  • Give recommendations regarding the strategy for customer interaction and retention.

1. Open the data file and read the general information.

In [1]:
#!pip install plotly --upgrade
In [2]:
# import libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly import graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import KMeans
In [3]:
# open the dataset
try:
    data = pd.read_csv('/datasets/gym_churn_us.csv', sep = ',') # path for working on the platform
except:
    data = pd.read_csv('datasets/gym_churn_us.csv', sep = ',') # path for local working
In [4]:
data.head()
Out[4]:
gender Near_Location Partner Promo_friends Phone Contract_period Group_visits Age Avg_additional_charges_total Month_to_end_contract Lifetime Avg_class_frequency_total Avg_class_frequency_current_month Churn
0 1 1 1 1 0 6 1 29 14.227470 5.0 3 0.020398 0.000000 0
1 0 1 0 0 1 12 1 31 113.202938 12.0 7 1.922936 1.910244 0
2 0 1 1 0 1 1 0 28 129.448479 1.0 2 1.859098 1.736502 0
3 0 1 1 1 1 12 1 33 62.669863 12.0 2 3.205633 3.357215 0
4 1 1 1 1 1 1 0 26 198.362265 1.0 3 1.113884 1.120078 0
In [5]:
# change of type and round of the data
data.columns = data.columns.str.lower()
data['avg_additional_charges_total'] = data['avg_additional_charges_total'].round(2)
data['month_to_end_contract'] = data['month_to_end_contract'].astype(int)

2. Exploratory data analysis (EDA)

In [6]:
# Find missing values.
display(data.info(), data.describe())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 14 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   gender                             4000 non-null   int64  
 1   near_location                      4000 non-null   int64  
 2   partner                            4000 non-null   int64  
 3   promo_friends                      4000 non-null   int64  
 4   phone                              4000 non-null   int64  
 5   contract_period                    4000 non-null   int64  
 6   group_visits                       4000 non-null   int64  
 7   age                                4000 non-null   int64  
 8   avg_additional_charges_total       4000 non-null   float64
 9   month_to_end_contract              4000 non-null   int32  
 10  lifetime                           4000 non-null   int64  
 11  avg_class_frequency_total          4000 non-null   float64
 12  avg_class_frequency_current_month  4000 non-null   float64
 13  churn                              4000 non-null   int64  
dtypes: float64(3), int32(1), int64(10)
memory usage: 422.0 KB
None
gender near_location partner promo_friends phone contract_period group_visits age avg_additional_charges_total month_to_end_contract lifetime avg_class_frequency_total avg_class_frequency_current_month churn
count 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000 4000.000000
mean 0.510250 0.845250 0.486750 0.308500 0.903500 4.681250 0.412250 29.184250 146.943730 4.322750 3.724750 1.879020 1.767052 0.265250
std 0.499957 0.361711 0.499887 0.461932 0.295313 4.549706 0.492301 3.258367 96.355654 4.191297 3.749267 0.972245 1.052906 0.441521
min 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 18.000000 0.150000 1.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 27.000000 68.865000 1.000000 1.000000 1.180875 0.963003 0.000000
50% 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 29.000000 136.220000 1.000000 3.000000 1.832768 1.719574 0.000000
75% 1.000000 1.000000 1.000000 1.000000 1.000000 6.000000 1.000000 31.000000 210.947500 6.000000 5.000000 2.536078 2.510336 1.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 12.000000 1.000000 41.000000 552.590000 12.000000 31.000000 6.023668 6.146783 1.000000

There aren't missing values. Columns 'Avg_additional_charges_total' and 'Lifetime' are not normally distributed.

In [7]:
# Look at the mean feature values in churn and stayed groups.
group_avg = data.groupby('churn', as_index=False).mean()
group_avg
Out[7]:
churn gender near_location partner promo_friends phone contract_period group_visits age avg_additional_charges_total month_to_end_contract lifetime avg_class_frequency_total avg_class_frequency_current_month
0 0 0.510037 0.873086 0.534195 0.353522 0.903709 5.747193 0.464103 29.976523 158.445716 5.283089 4.711807 2.024876 2.027882
1 1 0.510839 0.768143 0.355325 0.183789 0.902922 1.728558 0.268615 26.989632 115.082903 1.662582 0.990575 1.474995 1.044546

As we can see from the table some average features have a significant difference between groups: Partner, Promo friends, Contract period, Group visits, Lifetime and so on.

In [8]:
# count values of churn column
churn_count = data.groupby('churn')['gender'].count().reset_index()
churn_count = churn_count.rename(columns={'churn':'status', 'gender':'n_users'})
In [9]:
# Plot bar of distributed of churned and stayed users
fig = px.pie(churn_count, values='n_users', names=['Stayed users', 'Churned users'], 
            color_discrete_sequence=px.colors.qualitative.Set2)
fig.update_traces(textinfo='value + percent') 
fig.update_layout(title={'text':'Churned users and Stayed users', 'x':0.5})

fig.show()

As you can see from the pie, 26.5% of users are going to give up services. This is a fairly large churn rate. The business has clear problems.

In [10]:
# plot distribotions of features for stayed and churned users
data_feats = data.drop('churn', axis=1)
fig = make_subplots(rows=5, cols=3, subplot_titles=("Gender", "Near location", "Partners program", 
                                                    "Friends promo", "Is there a phone?", "Contract period", 
                                                    "Group visits", "Age", "Total additional charges",
                                                    "Months to end contract", "Lifetime", "Visits frequency total",
                                                    "Visits frequency in current month"))
col=1
row=1
statuses=['Stayed', 'Churned']
colors=px.colors.qualitative.Set2
for label, content in data_feats.items():
    for ch in range(0,2):
        status = statuses[ch]
        color = colors[ch+2]
        if label == 'gender':
            fig.add_trace(go.Histogram(x=data.query('churn == @ch')[label], 
                                                   name=status, text=status, 
                                                   marker={'color':color}), row, col)
            fig.update_xaxes(tickvals = [0,1], ticktext=['Female','Male'] , row=row, col=col)
        else:
            fig.add_trace(go.Histogram(x=data.query('churn == @ch')[label], name=status, 
                                                   text=status, showlegend=False, 
                                                   marker={'color':color}), row, col)
                   
            if label in ['near_location', 'partner', 'promo_friends', 'phone', 'group_visits']:
                fig.update_xaxes(tickvals = [0,1], ticktext=['No','Yes'], row=row, col=col)
            if label in ['contract_period', 'month_to_end_contract']:
                fig.update_xaxes(tickvals = [1,6,12], row=row, col=col)        
    if col<3:
        col+=1
    elif col==3:
        col=1
        row+=1            
    
fig.update_layout(title={'text':'Distibutions of features for stayed and churned users', 'x':0.5}, 
                  height=1500, bargap=0.1, barmode='group')
fig.show()

As rule, churned users have short contracts and lifetime, don't part of partners and promo programs, have a lower frequency of visits.

In [11]:
#Build a correlation matrix
plt.figure(figsize=(12,8))
ax = sns.heatmap(data_feats.corr(), cmap="Purples", annot=True)
plt.title('Correlation matrix of the features', fontdict={'size':15})
plt.show()

As we can see from the heatmap there is a high correlation between the contract period and months to end the contract (97%), between the total average visits frequency and the current month average visits frequency (95%). We will account for it when predict.
You can also notice that marketing programs (partners and friends) have a middle correlation (45%). Users are likely to try one program and appreciate the benefits and than taking part in another. Moreover, the promo activity of users depends on the length of the contract (31% of correlation for partners program).

3. Build a model to predict user churn.

In [12]:
# Write a function for forecasting user churn by logistic regression and random forest methods

def predict_user_churn(X,y):
    # prepare data
    X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.2, random_state=0) 
    scaler = StandardScaler()
    X_train_st = scaler.fit_transform(X_train) 
    X_test_st = scaler.transform(X_test)

    # train and predict by Logistic Regression model
    lr_model = LogisticRegression(random_state=0)
    lr_model.fit(X_train_st, y_train)
    y_prediction_lr = lr_model.predict(X_test_st)

    # train and predict by Random Forest model
    rf_model = RandomForestClassifier(n_estimators = 100, random_state=0)
    rf_model.fit(X_train_st, y_train)
    y_prediction_rf = rf_model.predict(X_test_st)

    # print result
    result = pd.DataFrame(data={'Accuracy':[accuracy_score(y_test, y_prediction_lr), 
                                            accuracy_score(y_test, y_prediction_rf)],
                            'Precision':[precision_score(y_test, y_prediction_lr),
                                         precision_score(y_test, y_prediction_rf)],
                            'Recall':[recall_score(y_test, y_prediction_lr), 
                                      recall_score(y_test, y_prediction_rf)]},
                     index=['Logistic Regression', 'Random Forest'])
    
    return result.style.format('{:.2f}')
In [13]:
# Build a model to predict user churn by logistic regression and random forest methods.
predict_user_churn(X = data.drop('churn', axis=1),
                   y = data['churn'])
Out[13]:
Accuracy Precision Recall
Logistic Regression 0.92 0.85 0.83
Random Forest 0.91 0.83 0.82
In [14]:
# Build a model to predict user churn again with accounting of correlation of features.
# Since we have a high correlation between 2 pairs of columns, rid of those to avoid multicollinearity.
predict_user_churn(X = data.drop(['churn','month_to_end_contract', 'avg_class_frequency_current_month'], axis=1),
                   y = data['churn'])
Out[14]:
Accuracy Precision Recall
Logistic Regression 0.90 0.79 0.81
Random Forest 0.89 0.79 0.77

As we can see from the first table, the Logistic Regression method shows higher accuracy, precision and recall than Random Forest method.
With the exclusion of highly correlated fields, we see that the accuracy of the Logistic Regression becomes equal to that of the Random Forest method.
Thus, of all the options for building models, the most successful is the Logistic Regression method without eliminating multicollenarity.

4. Create user clusters.

In [15]:
#  build a matrix of distances
X = data.drop('churn', axis=1)
sc = StandardScaler()
X_sc = sc.fit_transform(X)
linked = linkage(X_sc, method = 'ward') 
In [16]:
# plot a dendrogram
plt.figure(figsize=(12, 8))  
dendrogram(linked, orientation='top', no_labels=True)
plt.title('Hierarchical clustering for Model Fitness customers ')
plt.gca().spines["top"].set_alpha(0.0)    
plt.gca().spines["bottom"].set_alpha(0.5)
plt.gca().spines["right"].set_alpha(0.0)    
plt.gca().spines["left"].set_alpha(0.5)
plt.show()

As we can see from the plot we got 4 different groups of users by linkage algorithm.

In [17]:
# Train the clustering model with the K-means algorithm and predict customer clusters
km = KMeans(n_clusters = 5, random_state=0)
labels = km.fit_predict(X_sc)
data['cluster'] = labels
In [18]:
# Look at the mean feature values for clusters
mean_feature_cluster = data.groupby('cluster', as_index=False).mean()
mean_feature_cluster.round(2)
Out[18]:
cluster gender near_location partner promo_friends phone contract_period group_visits age avg_additional_charges_total month_to_end_contract lifetime avg_class_frequency_total avg_class_frequency_current_month churn
0 0 0.52 0.86 0.47 0.31 0.0 4.79 0.43 29.30 143.96 4.48 3.92 1.85 1.72 0.27
1 1 0.48 0.81 0.00 0.09 1.0 1.86 0.32 28.14 131.30 1.79 2.35 1.33 1.09 0.55
2 2 0.56 0.86 0.34 0.20 1.0 2.73 0.45 30.20 164.63 2.52 5.01 2.93 2.93 0.05
3 3 0.51 0.75 1.00 0.45 1.0 2.55 0.30 28.50 129.81 2.37 2.83 1.36 1.18 0.40
4 4 0.50 0.94 0.75 0.54 1.0 11.35 0.56 29.99 164.56 10.38 4.82 2.03 2.02 0.02
In [19]:
# Plot distributions of features for the clusters

feat_cluster = data.drop('cluster', axis=1)
fig = make_subplots(rows=5, cols=3, 
                    subplot_titles=("Gender", "Near location", "Partners program", 
                                    "Friends promo", "Is there a phone?", "Contract period", 
                                    "Group visits", "Age", "Total additional charges",
                                    "Months to end contract", "Lifetime","Visits frequency total",
                                    "Visits frequency in current month", "Churn users"))
col=1
row=1
colors = px.colors.qualitative.Set2
for label, content in feat_cluster.items():
    for clust in range(0,5):
        status = 'Cluster '+str(clust)
        color = colors[clust]
           
        if label == 'gender':
            fig.add_trace(go.Histogram(x=data.query('cluster == @clust')[label], 
                                                   name=status, text=status, 
                                                   marker={'color':color}), row, col)
            fig.update_xaxes(tickvals = [0,1], ticktext=['Female','Male'] , row=row, col=col)
        else:
            fig.add_trace(go.Histogram(x=data.query('cluster == @clust')[label], name=status, 
                                                   text=status, showlegend=False, 
                                                   marker={'color':color}), row, col)
                   
            if label in ['near_location', 'partner', 'promo_friends', 'phone', 'group_visits']:
                fig.update_xaxes(tickvals = [0,1], ticktext=['No','Yes'], row=row, col=col)
            if label in ['contract_period', 'month_to_end_contract']:
                fig.update_xaxes(tickvals = [1,6,12], row=row, col=col)
            if label in 'churn':
                fig.update_xaxes(tickvals = [0,1], ticktext=['Stay','Churn'], row=row, col=col)
            
    if col<3:
        col+=1
    elif col==3:
        col=1
        row+=1
fig.update_layout(title={'text':'Distibutions of features by clusters', 'x':0.5}, 
                  height=1500, bargap=0.1, barmode='stack', legend_traceorder="normal")
fig.show()

We got a division of clients into 5 clusters. Note that it would be advisable to divide them into 4 clusters, as shown by the dendrogram, but the task indicated the division into 5 clusters.
So, let's sort the clusters according to the degree of customer retention and give characteristics to each:

  • 1st place - Cluster 0. Cluster includes users with the longest contracts - almost 11 months, the highest participation rates in the partners program (78%) and friend promo program (57%). 96% of users have near location. Cluster retention champion, only 3% of customers quit (5th place of CR).
  • 2nd place - Cluster 4. On average, cluster users have the highest lifetime - almost five months and the highest income from additional services - \$ 161, as well as the highest frequency of visits - 2.9 times a week. Almost all users live nearby (98\%). Participation rates in marketing programs are the lowest: 34\% - partners program and 22\% - friends promo. There are 56\% of men in the cluster (the biggest difference among all clusters). 7\% of cluster users left the next month (4th place of CR).
  • 3rd place - cluster 1. Cluster of users who have not provided their phone. All other clusters provided their phone number. The rest of the indicators are quite average. Churn rate 27\% (3rd place of CR).
  • 4th place - cluster 2. A cluster of users who live far from the gym. The cluster has the lowest rates of participation in marketing programs: friends promo (8\%), participation in group visits (22\%). Churn rate 44\% (2nd place of CR).
  • 5th place - Cluster 3. Cluster has the lowest contract duration (1.9 months), lifetime (2.4 months) and average visits - 1.2 per week. 100\% of users live nearby. Cluster churn champion - 52\% of users left in the next month (1st place of CR).
In [20]:
# Calculate the churn rate for each cluster
group_cr = data.groupby('cluster', as_index=False).agg({'churn':['sum','count']})
group_cr.columns=['cluster','churned_users','n_users']
group_cr['churn_rate'] = group_cr.apply(lambda x:x[1]/x[2], axis=1).round(2)
group_cr['cluster'] = group_cr['cluster'].astype(str)
In [21]:
fig = px.bar(group_cr, x="churn_rate", y="cluster", color='cluster', 
             orientation='h', text='churn_rate',
              color_discrete_sequence=px.colors.qualitative.Set2)
fig.update_traces(texttemplate='-%{text:.0%}', textposition='auto')
fig.update_layout(title={'text':'Churn rate by clusters', 'x':0.5})
fig.show()

5. Conclusions and recommendations.

So, we conducted an analysis, the results of which can help us identify the key factors affecting customer churn, as well as develop measures to increase user retention.
Based on their characteristics of clusters, we can distinguish parameters that may indicate a quick departure of the client:

  • Short term of the contract.
  • Small number of visits,
  • Remoteness from the gym.
  • Poor participation in promo.
  • Lack of a telephone. As well as parameters that indicate customer satisfaction:
  • Long term of the contract.
  • High frequency of visits.

Many users from the groups with the maximum CR demonstrate the following behavior: they buy the minimum contract, go once a week, refuse group visits and, as a result, leave after a month.

Hypotheses why users leave:

  • Users lack the time or motivation to continue practicing.
  • Users are not satisfied with something in the gym, which discourages the desire to continue classes. For example: problems with parking, high congestion of the gym, poor service, poor equipment, and so on.
  • There are more attractive offers from competitors both in terms of price and quality of services.

Recommendations for the marketing department:

  1. Conduct a survey for clusters 1, 2, 3 to find out the degree of customer satisfaction and the specific reasons for their dissatisfaction.
  2. Research the local service market. Compare competitors' offerings and service levels.
  3. Develop marketing materials aimed at increasing customer motivation to visit the gym. For example, brochures, videos, mailing articles with inspiring success stories of those who regularly exercise fitness.
  4. Caster 4 has the lowest rates of participation in the partners program and promo friends program. At the same time, the cluster shows a high degree of customer satisfaction (CR is only 7%). This group has the potential to become service ambassadors and bring in many new clients. It is necessary to find out where clients work and offer these companies participation in the partners program. Also, offer these customers increased bonuses if they bring their own friends.
In [ ]: